Red Wine Quality by Anav Gupta

Packages

Before we start exploring the data, we will first load the packages that we will need for this exploration.


Load the Data

Lets load the data necessary for this exploration.

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : Factor w/ 10 levels "1","2","3","4",..: 5 5 5 6 5 5 5 7 7 5 ...

This tidy data set contains 1,599 red wines with 11 variables on the chemical properties of the wine. At least 3 wine experts rated the quality of each wine, providing a rating between 0 (very bad) and 10 (very excellent).


Univariate Plots Section

In this section we will try to analyze the variable individually.

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##                                                                   
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##                                                        
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##                                                                        
##     alcohol         quality     quality.num    quality.group
##  Min.   : 8.40   5      :681   Min.   :3.000   (2,4]:  63   
##  1st Qu.: 9.50   6      :638   1st Qu.:5.000   (4,6]:1319   
##  Median :10.20   7      :199   Median :6.000   (6,8]: 217   
##  Mean   :10.42   4      : 53   Mean   :5.636                
##  3rd Qu.:11.10   8      : 18   3rd Qu.:6.000                
##  Max.   :14.90   3      : 10   Max.   :8.000                
##                  (Other):  0

Quality

Now lets have a look at the quality variable

It seems the most of the wines in this sample have quality score of 5 and 6 .


Fixed.Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90
## 
##   (4,5]   (5,6]   (6,7]   (7,8]   (8,9]  (9,10] (10,12] (12,16] 
##       9      62     291     496     300     188     194      59
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##  4.6  6.5  7.0  7.2  7.6  7.9  8.3  8.9  9.7 10.7 15.9
##    90%    91%    92%    93%    94%    95%    96%    97%    98%    99% 
## 10.700 10.900 11.100 11.400 11.512 11.800 12.000 12.400 12.700 13.300 
##   100% 
## 15.900

The Fixed Acidity of the wine seems to max out around the 6 to 8 units. We can clearly see the presence of the outliers in the data set. This seems appropriate as higher values of fixed acidity will turn the wine more acidic in nature.


Volatile Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
## 
## (0.1,0.3] (0.3,0.4] (0.4,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8]   (0.8,1] 
##       143       302       314       336       281       119        83 
##   (1,1.6] 
##        21
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
## 0.120 0.310 0.370 0.415 0.470 0.520 0.570 0.610 0.660 0.745 1.580
##    90%    91%    92%    93%    94%    95%    96%    97%    98%    99% 
## 0.7450 0.7609 0.7800 0.7850 0.8200 0.8400 0.8700 0.9000 0.9600 1.0200 
##   100% 
## 1.5800

High levels of Volatile Acidity can lead to a unpleasant and vinegar like taste. This seems to the reason why only few percentage of wines have higher volatile Acidity. Most of the wines seems to contain .4 to .6 levels of volatile acidity.


Citric Acid

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

It seems that citric Acid is not present in many of wine samples. The citric acid distribution is quite flat. It will be nice to explore the quality of the wines that don’t have any citric Acid.


Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

It seems that majority of the wines have residual sugar level between 1 and 3. As you can see from the above graph that there are wine samples that far more residual sugar. It will be interesting to compare the quality of wines based on the residual sugar.


Chlorides

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

We can clearly see that the chlorides are found in very minute quantities in the wine samples. The chlorides seems to have a Normal distribution with many outliers. About 95 % of the wines contain chlorides in the range of 0.040 to 0.125.


Free Sulphar Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    1    5    6    9   11   14   16   19   24   31   72
##   90%   91%   92%   93%   94%   95%   96%   97%   98%   99%  100% 
## 31.00 31.00 32.00 33.00 34.00 35.00 37.00 39.00 42.00 50.02 72.00

From the graph as well as from the Quintilian function we can clearly see that in majority of the wines(90%) the amount of free sulfur dioxide is with in 31 units. The presence of sulfur dioxide in the low concentration is undetectable, but at free concentration over 50 ppm, the sulfur dioxide become evident in the nose as well as the taste of the wine. I suppose this is why only 1 percent of wines have it’s concentration greater than 50.


Total Sulfur Dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##   6.0  14.0  19.0  24.0  30.0  38.0  45.8  55.0  69.0  93.2 289.0

Total Sulfur Dioxide is the the total amount of sulfur dioxide in the wine and hence there will be some kind of relation between the free and total sulfur dioxide.


Density

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

we can clearly see that the density of the wine varies over a narrow range. Median and mean of the density is equal. This suggest that it has a normal distribution.


pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

Ph is an index which indicates the acidity or the alkalinity of the water soluble substance. we can clearly see that the density of the wine varies over a narrow range.Median and mean of the density is equal. This suggest that it has a normal distribution. We can see that the ph value for a wine range over a narrow values of 3 to 4.


Sulphates

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
## 0.330 0.500 0.540 0.564 0.590 0.620 0.650 0.700 0.760 0.850 2.000
##    90%    91%    92%    93%    94%    95%    96%    97%    98%    99% 
## 0.8500 0.8600 0.8700 0.8900 0.9100 0.9300 0.9708 1.0500 1.1300 1.2604 
##   100% 
## 2.0000
## 
##   (0,0.5] (0.5,0.6] (0.6,0.7] (0.7,0.8] (0.8,0.9]   (0.9,1]     (1,2] 
##       178       545       410       237       127        44        58

From the histogram as well as from the table information we can clearly see that majority of the wines have 0.5 to 0.7 units of sulphates concentration in the wine.


Alcohol

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## 
##   (8,9]  (9,10] (10,11] (11,12] (12,13] (13,15] 
##      37     710     444     267     118      23

From the histogram as well as the table summary, we can see that that 50 percent of the wines have 9-10 percent of alcohol content.


Univariate Analysis

In this section we will list the analysis of the univariate exploration.

What is the structure of your dataset?

Our data set consists of 1599 observation having 11 physicochemical inputs and a output that gives the quality of the wine. The quality variable is an ordered factor variable with following levels :

(Worst) —————-> (Best) 1, 2, 3, 4, 5, 6, 6, 7, 8, 9, 10

Other Observations:

  • Most Wines have a ph of 3 to 4.
  • Most Wine have an alcohol content of 9 to 10 percent.

What is/are the main feature(s) of interest in your dataset?

We can see that the variables Free sulfur Dioxide and Total Sulfur Dioxide will be connected in some way. In the same manner, Fixed and Total acidity are connected to each other. It will be interesting to find that whether the quantity of alcohol in the wine have any influence on the quality of the wine or not. How does the quantity of Citric Acid, which add freshness and flavor to wines effect the quality of the wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

Chlorides, Sugar and Ph are some of the variables that will support my investigation into my features of interest.

Did you create any new variables from existing variables in the dataset?

Yes, I created two new variables out of the existing quality variable. Firstly, I created a factor of quality variable. Secondly, I created a variable ‘quality.group’ which is created by cutting the quality variable into 3 equal sizes.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

I have converted the quality into a factored variable. This will aid me to visualize the input variable that lead to different quality of wines.

Bivariate Plots Section

This section will try to explore two variables at a time.

From the above graph we can see some significant correlation among the following variables : * pH and Fixed Acidity : - 0.683 * Fixed Acidity and Citric Acid : 0.672 * Density and Fixed Acidity : 0.668 * Free Sulfuric Dioxide and Total Sulfuric Dioxide : 0.668 * Volatile Acidity and Citric Acid : - 0.552 * Citric and pH : - 0.542 * Density and Alcohol : - 0.5

Now let test all the physicochemical input of the wine with the Quality (output) of the wine.

Volatile Acidity and Quality

From the above Box plot of the Volatile Acidity, we can make some connection between the volatile acidity and the Quality of the wine. In lower quality wines, volatile acidity is very dispersed and the dispersion lowers down as we move to better quality wines. We can see that the median as well as the mean of the volatile acidity in the wine starts to reduce with the increase in the quality of wine. From the Boxplot it seems that .3 to .5 is the ideal range for volatile acidity.


Fixed Acidity and Quality

Fixed Acidity for the wines having quality 3 or 4 is very dispersed. For the wines with quality score 5, 6 and 7 we can see from the a

It’s difficult to sight some trend from the above boxplots. Although it can be said that for each quality of wine 50 percent time, the fixed acidity lies between 7 to 10 units.


Citric Acid and Quality

## # A tibble: 6 x 2
##   quality     n
##    <fctr> <int>
## 1       3     7
## 2       4    43
## 3       5   624
## 4       6   584
## 5       7   191
## 6       8    18

The trend we can make about the citric acid content and quality of the wine is that the content of citric acid (mean as well as median) increases with the increase in the quality of the wine.


Ph and Quality

From the boxplot you can’t find out much about the quality of the wine from it’s
pH value. One thing is sure from the boxplot that the pH of the wine generally remains within the range of 3 to 4, with about 50 percent of times within 3.2 to 3.4.


Residual Sugar vs Quality

From the boxplot above, we don’t seem to have any kind of relation ship or trend between residual sugar content and the quality of the wine. For 87 percent of times, the residual sugar falls with the range of 1 to 3. Median of the residual sugar content remain constant for all the qualities of wine. some wines with quality 5 and 6 have high amount of residual sugar. More than 50 percent of times for all qualities of wine, residual sugar remains within the range of 2 - 3.


Chlorides vs Quality

It looks like that there doesn’t seem to be any trend between the quantity of chlorides and the quality of the wine. For most of the wines, the quantity of chloride falls within 0.05 and 0.1 units.


Free Sulfur Dioxide vs Quality

From the Boxplot above it doesn’t seem like that the free sulfur dioxide has any effect on the quality of the wine.


Total Sulfur Dioxide vs Quality

Total Sulfur Dioxide doesn’t seem to have any kind of relationship with the quality of an alcohol. It just so happens that for more that 50 percent of time total sulfur dioxide present in the wine is less than or equal to 50.


Sulphates vs Quality

The median as well as the mean of quantity of the sulphate increases with the increase in the quality of the wine.


Density vs Quality

As we had already seen that the density of the wine varies over a narrow period. It’s difficult to find the trend between the density and the quality of a wine.


Alcohol vs Quality

The above boxplots seems to suggest that as the alcohol content increases in the wine, it’s quality increases. However this cannot be said with surety since, we can see that some of the wines of quality 5 have such high alcohol content.


Now that we have tried to compare the physicochemical properties of wine with the quality of the wine. Now lets try to relate the physicochemical properties itself.

Fixed Acidiy and Citric Acid

We know that the citric acid is a non-volatile acid and the fixed acidity tend to calculate the non-volatile acid content of the wine.

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and citric.acid
## t = 36.234, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6438839 0.6977493
## sample estimates:
##       cor 
## 0.6717034

We do see some kind of correlation between the citric acid and the fixed acidity which was somewhat expected.


Fixed Acidity and pH

In the graph, it seems that the smooth line is going through the middle of the major portion of points.

## 
##  Pearson's product-moment correlation
## 
## data:  fixed.acidity and pH
## t = -37.366, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7082857 -0.6559174
## sample estimates:
##        cor 
## -0.6829782

From above we can say that Fixed acidity and pH are negatively correlated. This seems plausible as well. As the acidic content of the wine increase, the pH value which gives the extent of the alkalinity/acidity should decrease. A substance with pH with value 0 is most acidic.


Density and Fixed Acidity

The data is dispersed in the upper half of the smooth line. I guess we do see some correlation. The smooth line somewhats affirm our belief.

## 
##  Pearson's product-moment correlation
## 
## data:  density and fixed.acidity
## t = 35.877, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6399847 0.6943302
## sample estimates:
##       cor 
## 0.6680473

From the graph as well as from the R’s coefficient, we can say that density and fixed acidity are positively correlated.


Free vs Total sulfur Dioxide

These are two variables that tell us about of concentration of the sulfur dioxide in the wine either free or fixed. So even from the definition of these two variables itself, we can postulate that these two variables must be correlated. Lets try out our postulation.

## 
##  Pearson's product-moment correlation
## 
## data:  free.sulfur.dioxide and total.sulfur.dioxide
## t = 35.84, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.6395786 0.6939740
## sample estimates:
##       cor 
## 0.6676665

From the above analysis, it seems that they both are positively correlated.

## 
##  Pearson's product-moment correlation
## 
## data:  quality.num and alcohol
## t = 21.639, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.4373540 0.5132081
## sample estimates:
##       cor 
## 0.4761663
## 
##  Pearson's product-moment correlation
## 
## data:  quality.num and fixed.acidity
## t = 4.996, df = 1597, p-value = 6.496e-07
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.07548957 0.17202667
## sample estimates:
##       cor 
## 0.1240516

Density vs Alcohol

## 
##  Pearson's product-moment correlation
## 
## data:  density and alcohol
## t = -22.838, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.5322547 -0.4583061
## sample estimates:
##        cor 
## -0.4961798

We can see that there is about negative correlation among the density and alcohol. The smooth line does passes through most of important places.


Bivariate Analysis

This section lists the analysis of the bi-variate explorations.

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

The alcohol content in the wine correlates with the quality of the wine. With the increase in the quality of the wine, the average (mean and median) of the wine’s alcohol content increases.

The Volatile Acidity correlates mildly correlates (Negatively) with the quality of the wine.

With the increase in the quality of the wine, the median as well as the mean quantity of the volatile acidity decreases.

The content of citric acid (mean as well as median) increases with the increase in the quality of the wine.

The median as well as the mean of quantity of the sulphate increases with the increase in the quality of the wine.

The pH value of all wines remain in the range of 3-4. Specially, it can be seen that as the quality of the wine increases, about more than 50 percent of times, the pH value of the wine will remain within 3.2 to 3.4.

There doesn’t seem to be any kind of relation between residual sugar and the quality of the wine, but it must be noted that for more than 50 percent of time, the residual sugar was within 2-3 units.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Fixed acidity and Citric acid tends to correlate with each other. Since Citric acid is also a non-volatile acid, the fixed acidity gives the total non-volatile acid content, this relationship makes sense.

Fixed Acidity and pH are negatively correlated to each other. If the fixed acidity of the wine will increase, it’s pH value will decrease. This makes sense as with more non-volatile acid in the wine, it’s acidity will increase and hence it’s pH will decrease. Solution with 0 pH value is a most acidic substance.

Density and Fixed Acidity tend to correlate with each other (mildly). If we increase the fixed acidity of the wine, the density of the wine tend to increase.

What was the strongest relationship you found?

Our feature of interest, Quality have the strongest relationship with it’s alcohol content. The Quality of the wine is positively correlated with it’s alcohol content. This is the strongest relation we found.

Multivariate Plots Section

In this section we will try to examine multiple variable at a time.

Density vs Fixed Acidity by Quality

It can be clearly seen that the density of the wine increases with the increase in the fixed acidity of the wine. From above, we can see that the trend is followed irrespective of the wine quality.


Citric Acid vs Fixed Acidity by Quality

We can in all quality groups, with the increase of the fixed acidity the citric acid content all tend to increase.


Fixed Acidity vs pH by Quality

We can see that for the wine of the better quality, regression line’s slope remain almost constant, but for the wine with worst quality regression line’s slope changes very frequently as the fixed acidity is increased.


Density vs Alcohol by Quality

It can be clearly seen that for the wine of worst quality, the regression line’s slope tend to remain constant thoughout the graph. In a totaly opposite sense, the regression line of the wine with better qualities tend to wobble as the alcohol concentration increases. Generally, as the alcohol concentration increases the density decreases.


Free sulfur dioxide vs Total Sulfur Dioxide by Quality

We can see that for lower values of total sulfur dioxide there is very less amount of variance in the value of free sulfur dioxide. As the total sulfur dioxide starts to increase the variation in the amount of free sulfur dioxide. With the help of quality group we can see there is very frequent change for the wines in the normal category. The change is not that frequent in other group of wines.


Density vs Alcohol over Fixed Acidity by Quality

The Density of the Wine is strongly correlated with the Alcohol content of the wine and it’s fixed acidity. We can see that for all qualities of wine, the smooth tend to pass through the middle of the graph, suggesting correlation.

So as the concentration of the alcohol over fixed acidity increases, the density tend to decrease.


Linear model for density

Here we will create linear model with density, alcohol and fixed acidity.

## 
## Call:
## lm(formula = I(density) ~ I(alcohol) + I(fixed.acidity), data = redwine)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0046132 -0.0006805 -0.0001247  0.0006388  0.0045998 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       9.994e-01  3.116e-04 3207.75   <2e-16 ***
## I(alcohol)       -8.089e-04  2.612e-05  -30.96   <2e-16 ***
## I(fixed.acidity)  6.936e-04  1.599e-05   43.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.001111 on 1596 degrees of freedom
## Multiple R-squared:  0.6541, Adjusted R-squared:  0.6537 
## F-statistic:  1509 on 2 and 1596 DF,  p-value: < 2.2e-16

So it turns out that 65 percent of variance in density is explained by alcohol content and the fixed acidity of the wine.


Quality Linear Model

Now lets try to make a linear model for the quality of the wine.

One to check before making the linear model is that the variables in the model should not be correlated with each other. This can create ambiguity in deciding which component is responsible for the change in model.

From the correlation matrix we know that these variables are not correlated with each other.

  • Alcohol
  • pH
  • Sulphates
  • residual.sugar
  • Fixed Acidity
  • Chlorides
  • Total Sulfur Dioxide
## 
## Call:
## lm(formula = I(quality.num) ~ alcohol + sulphates + pH + residual.sugar + 
##     chlorides + total.sulfur.dioxide, data = redwine)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.60979 -0.38487 -0.05857  0.47000  1.95671 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.3795842  0.4144106  10.568  < 2e-16 ***
## alcohol               0.3206922  0.0172037  18.641  < 2e-16 ***
## sulphates             1.1811214  0.1101203  10.726  < 2e-16 ***
## pH                   -0.7588198  0.1156594  -6.561 7.22e-11 ***
## residual.sugar        0.0081097  0.0122741   0.661    0.509    
## chlorides            -2.7661785  0.4052075  -6.827 1.23e-11 ***
## total.sulfur.dioxide -0.0027888  0.0005349  -5.213 2.10e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.67 on 1592 degrees of freedom
## Multiple R-squared:  0.3142, Adjusted R-squared:  0.3116 
## F-statistic: 121.6 on 6 and 1592 DF,  p-value: < 2.2e-16

A model that tries to explain the variation in the quality of the wine. We get a r-squared value of 0.3142.


Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

There seems to be a strong relationship between Density and the Alcohol and the fixed acidity.

Were there any interesting or surprising interactions between features?

There is a surprising interaction between Density and the combination of alcohol and fixed acidity. It may have to do something with the chemistry of the fluids, but nonetheless it is an interesting relation that one find without the innate knowledge of the chemistry behind this interactions.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

I created a model for the examination of the quality of the wine variable with the help of some of the physicochemical inputs that were provided with the dataset. This model contain seven of the original 11 inputs in the wine data set. All these are pretty much not correlated to each other. This is good as it will not create any kind of ambiguity as which variable is moving the model.

Albeit, their combination seems to explain only 31 percent of variance in the quality of the wines.


Final Plots and Summary

Plot One

## 
##  Pearson's product-moment correlation
## 
## data:  density and alcohol/fixed.acidity
## t = -49.207, df = 1597, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7950235 -0.7560023
## sample estimates:
##        cor 
## -0.7762553

Description One

Here we see that the Density of the Wine is strongly correlated with the Alcohol content of the wine and it’s fixed acidity. We can as the concentration of the alcohol over fixed acidity increases, the density tend to decrease. We can very high concentration of the points

Plot Two

Description Two

The above plot contains a lot of information. It is a boxplot of alcohol by the quality groups. It clearly suggests that wines of better quality tend to have higher concentration of alcohol. This is also significant because alcohol from all other variables, has the strongest relationship with the wine quality.


Plot Three

Description Three

We all know that ph and acid content have a relation among each other. we are taught this in middle school. A liquid is consumable only at particular level of acidity. This graph help us showcase that innate relation between pH value of the wine and it’s Fixed Acidity. With enough correlation, we can say that the wine data do states the same relation among ph and acidity as it would have been normally accepted.


Reflection

Red Wine data set contained 11 physicochemical variables. I have explored all the variable against the output i.e the quality of the wine. While some variables like, alcohol, sulphates tend to have a effect on the quality of the wine, some didn’t.

From the exploration it seems that wines with more alcohol content tend to have higher quality. Same goes with the sulphates as well.

The relationship between the density of the wine and the Alcohol concentration plus the fixed acidity comes to me as a surprise. It wasn’t expected. This is what you get when you explore the data. For a chemist this may seems like a predefined relation, but for a guy with very limited knowledge with the chemistry of fluids this relation is a revelation.

The main problem that I face is that, we have limited amount of data. The fact we have only 10, 52 and only 18 wines with qualities 3, 4 and 8, is a sense of concern. Limited data does not help.

While I have explored the data set to my fullest abilities, I still feel that we could explore the data even further by taking complex combinations of alcohol and sulphates at a time. These two variable have shown to have stronger correlation with quality of the wine that any other variables.

This exploration will be of very menial value if we can’t increase the data available for the lower qualities as well as higher qualities of wine. To understand what parameters really make the wine bad or good, we will need to access more data.